-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] [skip ci] Fuzz testing in Spark SQL #7625
Conversation
Test build #38263 has finished for PR 7625 at commit
|
I just pushed a commit which adds some randomized tests of the DataFrame API and it appears to have uncovered some runtime crashes for some simple queries. Going to investigate to try to find some deterministic minimal reproductions. |
Test build #38283 has finished for PR 7625 at commit
|
Test build #38289 has finished for PR 7625 at commit
|
Test build #38302 has finished for PR 7625 at commit
|
Test build #41000 has finished for PR 7625 at commit
|
Test build #63813 has finished for PR 7625 at commit
|
Test build #63815 has finished for PR 7625 at commit
|
[skip ci]
This is a WIP pull request for some expression fuzz testing code that I'm working on as part of a
hackathon. I'm creating this pull request now in order to share the code and to have a pull request that I can reference from my other pull requests for fixing bugs that were found using this tester.
Features on my TODO list
List of potential bugs found during this testing
Note that most of these bugs are problems in analysis error reporting and not legitimate bugs in query execution. This tool isn't really capable of finding "wrong answer" bugs yet because it lacks an oracle for determining what the proper query answers are.
(:white_check_mark: indicates fixed, :construction: indicates a fix in progress)
Analysis issues:
The
createDataFrame()
methods should guard againstnull
values being passed in (e.g. the user passesnull
instead ofRow
).✅ The analyzer should check that join conditions have BooleanType: [SPARK-9292] Analysis should check that join conditions' data types are BooleanType #7630.
✅ The analyzer should ensure that set operations (union, intersect, and except) are only performed on tables that have the same number of columns: [SPARK-9293] [SPARK-9813] Analysis should check that set operations are only performed on tables with equal numbers of columns #7631
✅ Sorting based on array-typed columns should print an error at analysis time, not runtime. [SPARK-9295] Analysis should detect sorting on unsupported column types #7633
✅ - DataFrame.orderBy gives confusing analysis errors when ordering based on nested columns: https://issues.apache.org/jira/browse/SPARK-9323
The
DATAFRAME_EAGER_ANALYSIS
configuration flag does not work properly in all cases: there are still many corner-cases where invalid queries will eagerly throw analysis errors.Type mismatches in joins are sometimes confusing. Let's say that we have two RDDs with columns that have the same name, but where one column is a struct and another is a boolean. If we try to join on a nested field then this can result in a confusing "Can't extract value" message instead of a more informative message that explains that the types are mismatched:
Execution issues:
array<binary>
columns.Expression issues:
This is caused by extreme array sizes which overflow intmax.UTF8String.repeat
can throwNegativeArraySizeException
when applied to random bytes which have been casted to a string.UTF8String.reverse
can throwArrayIndexOutOfBoundsException
when applied to random bytes which have been casted to a string.✅ The methods in the
Unevaluable
trait should be final and the some of the new aggregate functions should inherit from this trait ([SPARK-9286] [SQL] Methods in Unevaluable should be final and AlgebraicAggregate should extend Unevaluable. #7627).For extremely small inputs, the results of the Remainder expression can differ in the codegen and non-codegen paths:
This is most likely a numeric stability issue.
Code generation frequently crashes for expressions containing null literals, but this isn't a problem that will impact users due to our codegen fallback path.
✅
NaNvl
should check that its two arguments are of the same floating point type: [SPARK-9549][SQL] fix bugs in expressions #7882✅ Code-generated numeric comparison expressions may fail to compile for Boolean types: [SPARK-9549][SQL] fix bugs in expressions #7882
Minor UX issues:
The ORC writer could log a more informative error message when the user isn't using a HiveSQLContext:
Confusing unresolved alias errors are thrown somewhat later in analysis than I'd expect. Ideally we would never see
UnresolvedException: Invalid call to dataType on unresolved object
since we would have ideally checked for resolution before inspecting the data types.dropDuplicates
seems especially prone to this problem.